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ABSTRACT 

We develop methods for accelerating metric similarity search 
that are efltective on modern hardware. Our algorithms fac- 
tor into easily parallelizable components, making them sim- 
ple to deploy and efficient on multicore CPUs and CPUs. 
Despite the simple structure of our algorithms, their search 
performance is provably sublinear in the size of the database, 
with a factor dependent only on its intrinsic dimensional- 
ity. We demonstrate that our methods provide substantial 
speedups on a range of datasets and hardware platforms. In 
particular, we present results on a 48-core server machine, 
on graphics hardware, and on a multicore desktop. 

1. INTRODUCTION 

We study methods to accelerate nearest neighbor search 
on modern parallel systems. We work in the standard metric 
nearest neighbor (NN) setting: given a database X, return 
the closest point to any query q, where closeness is mea- 
sured with some fixed metric. Though the problem setting 
is by now well-studied, there are two recent developments 
that provide a different focus for this work. The first devel- 
opment comes from studies in data analysis and advances 
in theoretical computer science, where the notion of intrin- 
sic dimensionality has been refined and profitably exploited 
to manage high-dimensional data. The second development 
comes from the computer hardware industry: CPUs are vir- 
tually all multicore now and CPUs have rapidly evolved into 
powerful secondary devices for numerical computation, forc- 
ing a renewed interest in parallelism. 

High-dimensional data challenges nearly all methods for 
accelerating NN search. Unfortunately, complicated, high- 
dimensional data is the norm in many domains that rely on 
NN search, such as computer vision [28], bioinformatics [4j, 
and data analysis in general [12]. Because such data must 
be dealt with, researchers have explored data properties that 
might be exploited to accelerate search. A compelling prop- 
erty is the intrinsic dimensionality of a data set; the idea is 
that often data only appears high-dimensional (i.e., each el- 



Permission to make digital or hard copies of all or part of this work for 
personal or classroom use is granted without fee provided that copies are 
not made or distributed for protit or commercial advantage and that copies 
bear this notice and the full citation on the first page. To copy otherwise, to 
republish, to post on servers or to redistribute to lists, requires prior specific 
permission and/or a fee. 

Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. 



ement has many features), but is actually governed by only 
a few parameters. In recent years, this type of data has been 
studied extensively and is now believed to be widespread; in 
machine learning, for example, several methods have been 
developed to reveal the intrinsic structure of data 26 27 



In research on accelerated NN retrieval, a renewed focus on 
intrinsic dimensionality in recent years has yielded methods 
with strong theoretical guarantees [18| |20| and state of the 
art empirical performance [2]. 

The end goal of all of this work on NN is, of course, to 
make NN search run fast. While properties of data are im- 
portant for computational efficiency, equally important is 
the machine hardware on which search actually runs. Hard- 
ware properties are especially important now, as the ma- 
chines in use for everyday data analysis and database op- 
erations are fundamentally different than they were even 
a decade ago. In particular, standard computer chips are 
multicore with vector (SIMD) units, and CPUs have be- 
come popular as secondary devices for data-intensive pro- 
cessing. These modern processors have very impressive com- 
putational capabilities, but fully exploiting it requires con- 
siderable parallelism. 

The shift towards parallelism in hardware necessitates a 
refocusing in algorithm design and software engineering meth- 
ods [24]. At the level of algorithm design, advances in al- 
gorithms and data structures may not be useful unless they 
can be effectively parallelized. At the level of software de- 
velopment, developing efficient implementations on paral- 
lel systems is challenging, so parallel primitives and design 
patterns are required to ease the burden on programmers. 
For the case of NN search, the modern search algorithms 
discussed were developed for sequential systems, and seem 
quite difficult to deploy on modern hardware. This is a ma- 
jor practical limitation. 

In this work, we develop an new approach to NN search 
that is both provably sensitive to intrinsic structure and 
that is effective on modern CPUs and CPUs. Our algo- 
rithms are based on fundamental ideas from metric simi- 
larity search, and are designed carefully for two important 
goals. First, our design choices allow us to rigorously bound 
the search complexity of our methods. Equally important, 
these choices allow us to factorize our algorithms into a basic 
primitive that is simple to parallelize, making them effective 
and relatively easy to implement on different systems. 

Our methods are of immediate practical use. We demon- 
strate their performance benefits on a range of modern plat- 
forms: a 48-core server machine, a GPU, and a multicore 
desktop. 



2. BACKGROUND AND RELATED WORK 

The focus of the present work is unique; as far as we know, 
it is the only work to simultaneously make use of modern 
algorithmic developments and modern hardware. Still, the 
ideas behind our methods are based on a substantial amount 
of previous work: the algorithmic ideas are based on tech- 
niques from metric similarity search; the results on intrinsic 
dimensionality are related to ideas developed mostly in the 
theory and data analysis communities; and the motivation 
behind the project comes from recent trends in computer 
hardware, especially as related to databases and scientific 
computation. 

The data structure and algorithm in this work are based 
on two fundamentals of algorithms for metric data: space 
decomposition and the triangle inequality. These pillars are 
used in virtually all work on metric NN search; see the sur- 
veys of Chavez et al. and Clarkson for detailed overviews 

fl [To] . Two of the most empirically effective structures are 
ESA [29] and metric ball trees |23[|31| , both of which have 
spawned many relatives. 

A long-standing problem in similarity search is the diffi- 
culty of dealing with high-dimensional data; see [l] [s] [SO] 
and the above surveys. The basic challenge is that space- 
decomposition structures that reduce the work for NN re- 
trieval seem to have performance that scales exponentially 
with the dimensionality of the data, rendering them useless 
to all but the smallest problems. Within the last two decades 
there have been two very promising directions of work that 
attempt to deal with the problem of high-dimensional data. 

The first is called Locality Sensitive Hashing (LSH) |16| . 
LSH has retrieval performance that is provably sublinear, 
independent of the underlying dimensionality. This was a 
major theoretical breakthrough and the data structure has 
been successfully deployed on some tasks (e.g. [2]). How- 
ever, LSH has some limitations: it can only provide ap- 
proximate answers, it is defined only for particular distance 
measures (not at the generality of metrics), and setting the 
parameters correctly can be complex 

The second line of work, upon which we build, is based 
on the notion of intrinsic dimensionality. The basic idea 
here is that many data sets only appear high-dimensional, 
but are actually governed by a small number of parameters. 
Within data analysis and machine learning, the idea of low- 
dimensional intrinsic structure has become extremely pop- 
ular and such structure is believed to be common in many 
data sets of interests [26[ |27| . 

This idea has also been explored in the context of NN 
search. A variety of slightly different notions of metric space 
dimensionality capture this intuition. One which has re- 
cently resulted in strong theoretical and empirical results 
is the growth dimension or expansion rate [9] [iS], which 
we define formally later. This notion of dimension lead to 
the development of the Cover Tree [5] [20] , which we return 
to momentarily. Though the notion has some idiosyncrasies 
|20| , the impressive empirical performance of the Cover Tree 
suggests that it is a useful notion. 

Perhaps the two most relevant methods for NN search are 
the Cover Tree and the GNAT of Brin [sj; let us distinguish 
this research from the present work. The GNAT uses a 
simple space decomposition based on representatives from 
the database, much as we do, and also discusses the idea 
of intrinsic dimensionality. However, the relationship of the 
gnat's search performance and the intrinsic dimensionality 



is only discussed in an informal, heuristic way, whereas we 
give rigorous runtime guarantees. These rigorous bounds 
require a search algorithm that is different than that of the 
GNAT. Additionally, parallelization is not discussed in [s]. 

The Cover Tree has rigorous guarantees on the query time 
dependent on the expansion rate and has empirical perfor- 
mance that is state of the art. Even so, our algorithms, data 
structure, and theory are substantially different; in partic- 
ular, the Cover Tree is a deep tree and is explored in a 
conditional way that seems quite difficult to parallelize. We 
compare our method with the Cover Tree in the experiments. 

Lastly, we touch on the major inspiration for this paper: 
the use of hardware to accelerate data-intensive processes. 
Impelled by the sudden ubiquity of multicore CPUs and the 
development of CPUs for general-purpose computation, this 
area of research has exploded in the last decade; let us pro- 
vide a couple of inspiring examples. A relatively early work 
develops methods to off-load expensive database operations 
onto the GPU 



15 



A very recent piece of work tunes basic 
tree search algorithms (such as for index lookup) to be ef- 
fective on modern multicore CPUs and GPUs [l9]. Finally, 
another paper suggests simply running brute force search on 
a GPU to accelerate NN search \5'; this simple approach pro- 
vides a surprising amount of acceleration over computation 
on sequential CPUs |14| . 

3. THE BRUTE FORCE PRIMITIVE 

Brute force search requires a high amount of work, but 
parallelizes effectively. In contrast, most accelerated ap- 
proaches to NN search reduce the work, but seem difficult to 
parallelize. Our approach to NN search takes the advantages 
of both: it reduces the work and parallelizes effectively. We 
achieve this combination by structuring our algorithms as 
multiple (actually two) brute force searches, each of which 
considers only a small portion of the database. In this sec- 
tion, we formalize the brute force primitive and discuss the 
parallelization of it. 

Given a set of queries Q and a database X with n ele- 
ments, finding the NNs for all 5 G Q can be achieved by a 
series of linear scans. For each query q, the distance between 
q and each x £ X is computed, the distances are compared, 
and the database point that is closest is returned. We de- 
note this subroutine as BF(Q,X). If L is some sets of IDs 
(i.e. 1/ C {1, . . . , n}), then brute force search between Q and 
this subset of the database is denoted JiF(Q, X[L]). 

The work required for BF(Q,Jf) is 0{n); we later prove 
that our search algorithms have work only roughly O(-yn), 
which is performed in two brute force calls BF((5,X[Li]) 
and BF(Q,X[L2]), where lists Li and L2 are determined 
by our algorithm. 

Parallelizing JiF{Q, X) is straightforward. We break the 
procedure down into two steps: a distance computation step, 
and a comparison step. In the distance computation step, all 
pairs of distances are computed. This has virtually the same 
structure as matrix-matrix multiply, and hence block decom- 
position approaches are effective. In the case where there is 
only a single query presented at a time (e.g. a stream of 
queries), the distance computation step of BF(g, X) has the 
structure of a matrix-vector multiplication. In both cases, 
the parallelization is extremely well-studied. 

The second step is the comparison: for each query, the 
distances must be compared, and the nearest database el- 
ement returned. This is simple to do in parallel systems 



as well; the problem can simply be plugged into the stan- 
dard parallel-reduce paradigm where comparisons are made 
according to an inverted binary tree. 

The computational structure of accelerated NN data struc- 
tures is quite different. For simplicity, let us take metric 
trees as an example i23|. Querying this data structure re- 
quires a depth- first search of a deep tree. This process in- 
volves an interleaved series of distance computations, bound 
computations, and distance comparisons. Moreover, the 
computation is conditional: the specific calculations made 
at one step depend on the comparisons from the previous 
step. Distributing the work for this process across proces- 
sors is a significant challenge. Additionally, because of the 
interleaved nature of the computation, attaining near full- 
utilization of the hardware is difficult; in contrast, matrix- 
matrix multiply is commonly used to demonstrate the ca- 
pability of processors. Finally, the conditional nature of the 
computation makes execution on vector hardware with lim- 
ited branching abilities — namely, GPUs — inefficient. 

Efficient NN routines seem to depend on a complex com- 
putational structure; a major contribution of the present 
work is in demonstrating that a much simpler structure can 
be substituted without significant loss. 

With the brute force primitive in place, we proceed to 
discuss our data structure and algorithms, all of which will 
be built from this primitive. 

4. DATA STRUCTURE 

We discuss the data structure underlying our methods in 
this section, which we call the Random Ball Cover (RBC). 
This is a very simple, single-level cover of the underlying 
metric space. The basic idea is to use a random subset of 
the database as representatives for the rest of the DB. The 
details of this structure differ slightly for our two different 
search algorithms, which will be introduced in the next sec- 
tion. 

The database is denoted X = {x\, . . . , x„} and the met- 
ric in use is denoted p(-,-)- The data structure consists of 
a random subset of the database, which will generally be 
of size about 0{^/n); we make this precise later. This set 
of random representatives will be denoted R. It is built by 
choosing each element of the database independently at ran- 
dom with probability rir/n, where the exact value of rir is 
discussed in the theory section. In expectation, there will 
be rir representatives chosen — one can think of the symbol 
Tl-r as shorthand for number of representatives. 

Each representative owns some subset of the database. 
The list of points that a representative r owns is denoted 
Lr- In the exact search algorithm, Lr contains all x £ X for 
which r is the nearest neighbor among R. In the one-shot 
algorithm, Lr contains a suitably sized set of r's NNs among 
X. We will at times refer to Lr as an ownership list. 

Along with each representative, a radius is also stored. 
This radius is defined as the distance from r to the furthest 
point that it owns: 

tpr ~ max p{x, r). 

xELr 

The focus of this paper is on algorithms that are simple 
to parallelize; this is achieved by folding the computational 
work into brute force calls. The building routines demon- 
strate this principle concisely. The building routine for the 
exact search algorithm must find the NN for each x £ X 



among the representatives R. Thus this routine is simply 
a call to BF(X, _R). Similarly, the building routine for the 
one-shot algorithm must find the NN(s) for each representa- 
tive R among the database elements X; thus this procedure 
is simply a call to BF(_R, X). Both parallelize easily. 

With the data structure and notation in place, we proceed 
to describe the search algorithms. 

5. SEARCH ALGORITHMS 

In this section, we describe two search algorithms which 
use the RBC data structure. The theoretical analysis of 
these algorithms appears in the next section. Both of these 
algorithms build up from the brute force search primitive. 

We first describe the one-shot algorithm, then the exact 
search algorithm. The one-shot algorithm is extraordinar- 
ily simple, the exact algorithm only slightly less so. Both 
algorithms rely on a randomized data structure, and so are 
probabilistic. In the one-shot algorithm, the solution itself 
is randomized: the data structure returns a correct result 
with high probability. In the exact algorithm, the solution 
is guaranteed to be correct; only the running time is proba- 
bilistic!^ Hence when an exact answer is required, the exact 
algorithm is appropriate; if a small amount of error can be 
tolerated, the one-shot algorithm is simpler and often faster, 
as we show in the experiments. 

We focus on the problem of 1-NN search throughout; the 
extensions to fc-NN and e-range search are straightforward. 

5.1 One-shot search 

First, recall the data structure built for the one-shot search 
algorithm. The representatives are chosen at random, and 
each list Lr contains the s closest database elements to r. 
Depending on the setting of s, points will typically belong to 
more than one representative. We discuss these parameters 
further in the theory section. 

On a query q, the algorithm proceeds as follows. It first 
computes the NN to q among the representatives using a 
simple linear scan (brute force search), call it r. It then 
scans the ownership list Lr, computing the distance from q 
to each listed database point. The nearest one is returned 
as the nearest neighbor. 

We restate the algorithm in terms of the brute force prim- 
itive. The algorithm first calls BF(g,i?), which returns a 
representative r. It then calls JiF{q, X[Lr]) and returns the 
answer. 

The one-shot algorithm is almost absurdly simple. Yet, as 
we show rigorously in the theory section, it provides a mas- 
sive speedup; in particular, it reduces the work from 0(n) 
(required for a full brute force search) to roughly 0(^/n). 
Moreover, it is very fast empirically, as we show in the ex- 
periments section. 

5.2 Exact search 

Whereas the one-shot algorithm does not use the triangle 
inequality (though the analysis requires it), the exact search 
algorithm explicitly prunes portions of the space using it; 
in this sense, it is reminiscent of classic branch-and-bound 
techniques. 

Recall that the data structure built for the exact search 
algorithm is slightly different from the one for the one search 

^This algorithm can be easily modified so that it only 
guarantees an approximate nearest neighbor, which reduces 
search time. 



algorithm (the reason for the difference will become clear in 
theory section). The build algorithm calls BF(X, 7?), then 
adds each x £ X to the ownership list of its returned NN in 
R. The radii tp,- are set to max^jgL^ p{x,r) as before. 

We now detail the search algorithm. First, the closest 
point to q among all r G -R is computed; call it r^, and 
let 7 = p{q,rq). This distance is an upper bound on the 
distance to q's NN (since £ X), and so the algorithm can 
use it to discard some of the database from consideration. 
Recall that the radius of each representative r is stored as 
tpr — i.e. each x £ Lr satisfies p{x,r) < ^r. Because 7 is an 
upper bound on the distance to the NN, any point belonging 
to an r satisfying 



p{q,r) > 7 + Vr 



(1) 



can be discarded. The following sketch illustrates Q; since 
there is a point within distance 7 of q, no points within 
distance -f/j^ of r can be g's NN. 




Hence the only points that need to be considered belong to 
a list Lr, where r violates ([T]). 

The algorithm simultaneously checks a second bound in 
hopes of pruning out more of the representatives. As Lemma 
[1] (see the next section) shows, if is g's NN among the 
representatives, any representative that owns g's NN must 
satisfy 



p{'l,r) < 3- p{q,rg). 



(2) 



Hence any representative violating this inequality is pruned 
out by the search algorithm. 

Once the pruning stage is complete, the search algorithm 
computes the distance to all points belonging to one of the 
remaining representatives, and returns the nearest. 

We restate the algorithm in terms of our primitive. It 
first computes BF(g,i?), much like the one-shot algorithm. 
In this case, however, the distances must be retained so that 
inequalities ([T| and ([2| can be checked. Once the inequali- 
ties are checked, some representatives will still remain, with 
lists L\, . . . , Lt. Next the search algorithm performs another 
brute force search, namely BF(g, X[Li U 1/2 U . . . U Li]), and 
returns the answer. 

We will see that the size of each of the brute force calls 
is about O(v'n), providing major time savings over a full 
brute force search. 

We emphasize that the computation structure of both 
search algorithms is quite different from tree-based search, in 
which bounds are incrementally refined, and distance com- 
putations are interleaved with bound evaluations. In both 
cases, this structure makes the algorithm extremely simple 
to implement and effective to parallelize. It is rather sur- 
prising that such simple algorithms can effectively reduce 
the work required for NN search, but that is exactly what 
we show both theoretically and empirically in the following 
sections. 



6. THEORY 

As we described in the background section, all methods 
that reduce the work for NN search have some dependence 
on the dimensionality of the database. Much of the success 
of metric indexing methods is commonly ascribed to their 
dependence only on the intrinsic dimensionality of data. In 
this section, we prove that the RBC search algorithms scale 
with the expansion rate, which is a useful notion of intrinsic 
dimensionality. 

Definition 1. Let B(x, r) denote the closed ball of radius 
r around x — i.e. the set {y : p{x,y) < r}. A finite metric 
space M has expansion rate c if for all r > and x £ M 

\B{x,2r)\ < c- \B{x,r)\. 

To gain some intuition for this measure, consider a grid of 
points in R'* under the £1 metric 



The expansion rate in this case is 2 



to the dimensionality of the data 18 



hence log c corresponds 
Notice that the ex- 



pansion rate is defined only in terms of the metric, not in 
terms of the representation of data; in this sense, the rate 
captures the intrinsic structure of the metric space. Notice 
also that the expansion rate is defined for arbitrary met- 
ric spaces, so makes sense for the edit distance on strings 
and the shortest path distance on the nodes of a graph, for 
example. 

Throughout we assume that X U Q has expansion rate c, 
and we prove bounds dependent on this expansion rate and 
n. 

The exact search algorithm and analysis rely on the fol- 
lowing lemma, which is known lO]. See Appendix|A]for the 
proof. 

Lemma 1. Let R <Z X and assign each x £ X to its near- 
est r £ R. Let 7 = minrgi? p(5, r) (i.e., 7 is the distance 
to q's NN in R). Then, if some r* £ R owns the nearest 
neighbor to q in X , it must satisfy 

p{q,r*) < 37. 



We now analyze the search algorithms. Throughout, we 
assume ii is a random subset of X, built by picking each ele- 
ment of X independently at random with probability Ur/n. 
Sometimes we will denote this probability as p. Recall that 
the ownership list of r G _R is denoted Lr and the radius 
of this list (i.e. max^^gz,^ p(a;, r)) is denoted tpr- Finally, 
Ur is the expected number of representatives and n is the 
cardinality of the database. 

6.1 Exact Search 

First, let us consider the exact search algorithm. The 
search algorithm performs two steps: in the first step, the 
algorithm performs brute force search from the queries to R; 
and in the second step, it performs a brute force search from 
the queries to the database elements belonging to ownership 
lists of un-pruned representatives. The first step clearly has 
work complexity 0{nr) per query, where Ur is the expected 
number of representatives; the following analysis bounds 
the complexity of the second step. In particular, we show 



that the expected number of distance evaluations is c'^n/ur. 
Hence if rir ~ c^^^y/n, the expected number of distance eval- 
uations in the second step is 0{c^^^ ^/n), the same as the 
first step. We call rir = 0{c^^'^ y/n) the standard parameter 
setting. 

In the first step of the algorithm, the nearest point to q 
in 7? is found; call this point r,. How many database points 
are likely to be closer to q than r^? 

Claim 1. Let 7 be the distance from q to its nearest neigh- 
bor in R, Tq. The expected number of points in B{q,'y) is 
n/ur, which is 0{_^pnl (?^'^') for the standard parameter set- 
ting. 

Proof. Form a list L = [xi, X2, . . . , Xn] by ordering the 
database points x € X by their distance to q. Some subset 
of X also belongs to R; let xt be the first representative 
appearing in L (i.e. the closest representative to q). Then 
the expected number of points in B{q, 7) is equal to t — 1. 

A slightly different way to view the process is that the L is 
fixed, then xi is chosen as a representative with probability 
Ur/n, then X2 is chosen as a representative with probability 
Ur/n, and so on. We wish to know the expected time before 
the first Xi is chosen as a representative. That is given pre- 
cisely by the geometric distribution: the number of Bernoulli 
trials needed to get one success. The mean of a Bernoulli 
distribution with parameter p = Ur/n is 1/p = n/ur. Hence 
E|_B(g, 7)1 = n/ur, which is 0{y'ri/c^^^) for the standard 
parameter setting. □ 

We note that a high-probability version of the above claim 
follows easily from standard concentration bounds. We also 
point out that the expectation is over randomness in the al- 
gorithm; we are not making any distributional assumptions 
on the database. 

After computing the nearest neighbor to the query q among 
the representatives, the exact search algorithm uses 7 (= 
p{q,rq)) as an upper bound on the distance to g's NN to 
prune out some representative sets. In particular, any rep- 
resentative r with radius satisfying 

P{<l,r)>'y + ipr (3) 

cannot possibly own q's NN. Additionally, the algorithm can 
safely prune out any representative r such that 

p(g,r)> 37. (4) 

This property follows from Lemma [l] In the following we 
only work with inequality Q . The simultaneous use of both 
inequalities improved the empirical performance, but it is 
not necessary for the following theory. 

We now estimate how many representatives violate Q 
and bound how many points these representatives own. We 
show that all examined database points belong to a ball 
B{q, 77), which we subsequently bound the cardinality of. 

Claim 2. The nearest neighbor of q lies inside of the ball 
B{qJl). 

Proof. Clearly, each representative r violating (j4| lies 
inside of B(q, 37), and the NN of q will appear on one of the 
lists Lr. If the algorithm examined every x £ Lr, it would 
be guaranteed to find the nearest neighbor, but we cannot 
say how many total points it will examine. However, the 
algorithm does not need to examine the entire list, as we 
now show. 



Suppose that x is g's NN; what is the maximum distance 
it can be from its representative r? From the triangle in- 
equality, p{x,r) < p{x,q) -\- p{q,r). But since x is q's NN, 
and since r £ X, p{x, q) < 7. The other term is bounded by 
37 on account of Hence the nearest neighbor x must lie 
within 47 of its representative. 

Since any considered representative r satisfies p{q, r) < 37 
(by Q) and the NN a; of g is within 47 of r, the triangle 
inequality implies that p{q, x) < 77. □ 

We have shown that the search algorithm only needs to 
compute distances from q to points x which are within dis- 
tance 47 of their representative. Hence, if the lists Lr are 
stored in sorted order according to the distance to r, the 
search algorithm can simply ignore all points x more than 
distance 47 from their representative]^ 

Finally, we bound the expected number of examined points. 
From Claim[2] all examined points lie in B{q, 77). Applying 
the expansion rate condition, we have that 

\Biq,7j)\ < |B(g,87)| <c^lB(g,7)!. (5) 

From Claim [TJ ¥,\B{q,'y)\ — Ur/n, which we can plug into 
([5|. As each x only appears on one list Lr, each x is only 
compared to q once, implying that ([5| bounds not only the 
cardinality of the examined set points, but also the number 
of computations (in the second step). 

Putting everything together, we have the following theo- 
rem. 

Thereom 1. The expected number of points examined in 
the second stage of the exact search algorithm is c^n/ur, 
which is 0(<?^'^ yfn) for the standard parameter setting. 

Since the time for the first brute force step was alsoO(c^/2V^), 
we have shown that the expected runtime of the exact search 
algorithm is 0(c^/^i/n). 

6.2 One-Shot Search 

The one-shot search algorithm is considerably simpler than 
the exact search algorithm, and also uses a slightly different 
data structure configuration. In particular, the algorithm 
searches only a single representative list per query, and the 
ownership lists of the RBC will usually overlap. Unlike the 
exact search algorithm, the one-shot algorithm only returns 
the NN with high probability, similar to locality sensitive 
hashing ^16^ . 

With these differences in mind, the resulting time com- 
plexity bound is actually quite similar to the bound in The- 
orem [l] Recall that there are two parameters governing its 
run time: n,-, the number of representatives; and s, the num- 
ber of points assigned to each representative. Hence the time 
complexity of the one-shot search algorithm is 0{nr -\- s). 
The following theorem describes the setting of these param- 
eters to guarantee a high probability of success. 

Thereom 2. Set the parameters 

Ur = s — c\fn ■ Y^ln ^. 

Then the one-shot algorithm returns the correct NN with 
probability at least 1 — 5. 

^For the purpose of scheduling on some systems, it may be 
advantageous to compute how much of each list Lr must 
be explored before the second brute force operation begins. 
This can be done in time O(logn) per list. 



Name 


Num pts 


Dim 


Bio 


200k 


74 


Covertype 


500k 


54 


Physics 


100k 


78 


Robot 


2M 


21 


Tinylm 


lOM 


4-32 



Table 1: Overview of data sets. 



The proof is in Appendix [B] 

Hence the time complexity of the one-shot algorithm is 
0{c^/n) times a factor dependent on the desired success rate. 
Notice that the dependence on c is lower for the one-shot 
algorithm than the exact algorithm; this reduced complexity 
seems to be reflected in the experiments. 



bio GOV phy robot tiny4 tinyS 1iny16 tiny32 

Figure 2: Speedup of exact search over brute force. 



7. EXPERIMENTS 

We perform several sets of experiments to demonstrate the 
effectiveness of our methods. The first, and probably most 
important, set of experiments demonstrate the performance 
benefit of the RBC on a 48-core machine, as compared to a 
brute force implementation (j ^7.2[ ). These experiments show 
that the RBC significantly reduces the work required for NN 
search and that it parallelizes effectively. 

The second set of experiments demonstrates that the RBC 
is effective on graphics hardware (S7.3I. It is challenging to 
deploy data structures on such hardware, but very important 
because of the ubiquity of CPUs in scientific and database 
systems. 

The final set of experiments compares the performance of 
the RBC to the Cover Tree on a desktop machine ( |7.4[ ). 
These experiments demonstrate that the exact search al- 
gorithm is competitive with the state-of-the-art even on a 
machine with a low degree of parallelism. 

7.1 Experimental setup 

Our CPU implementation of the RBC was written in C 
and parallelized with OpenMP. Our GPU implementation 
was written in C and CUDA. Both are available for down- 
load from the author's website. 

In very low-dimensional spaces, basic data structures like 
kd-trees are extremely effective, hence the challenging cases 
are data that is somewhat higher dimensional. We experi- 
ment on several different data sets over a range of dimen- 
sionalities. Table [T] provides a quick overview; we describe a 
few details next. 

The Bio, Covertype, and Physics data sets are standard 
benchmark data sets used in machine learning and are avail- 
able from the UCI repository 
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They have been used 
to benchmark NN search previously [2| [7| [25] . The Robot 
data was generated from a Barret WAM robotic arm; see 
[22| . The Tinylm data set is taken from the Tiny Images 
database, which is used for computer vision research [28| . 
We took the image descriptors and reduced the dimension- 
ality using the method of random projections|^ We experi- 
mented with dimensionalities of 4, 8, 16, 32. For all exper- 
iments, we measured distance with the ^2-norm (i.e. stan- 
dard Euclidean distance), which is appropriate for this data. 



^This dimensionality reduction technique approximately 
preserves the lengths of vector s, a nd hence is a useful prepro- 
cessor for NN search; see e.g. pTl. The techniqu e is formallv 
justified by the Johnson-Lindenstrauss Lemma 17 . 



We perform the first set of CPU experiments on a 48- 
core/4-chip AMD server machine. Each chip is a 12-core 
AMD 6176SE processor, and is divided into two 6-core seg- 
ments. This machine has a high core count, so it is a good 
system to test the scalability of our algorithms. 

We compare to the Cover Tree on a quad-core Intel Core 
15 machine, which is a reasonable representative for a mid- 
range desktop. 

Our GPU experiments are run on a NVIDIA Tesla c2050 
graphics card. This card is well-suited to general and sci- 
entific computation as it has somewhat more memory than 
standard graphics cards (3GB), and provides general-purpose 
architectural features such as error-correcting memory and 
caching!^ We previously described the details of our GPU 
implementation in a workshop paper [6]. 

7.2 48-core experiments 

We compare the performance of our methods to brute 
force search on the 48-core machine. As far as we know, 
there is no readily available accelerated NN method for such 
a machine. Furthermore, brute force is already quite fast 
because of the raw computational power. 

First, we look at the performance of the exact search algo- 
rithm, which is guaranteed to return the exact NN. Figure[2] 
shows the results. We are getting a strong speedup of up to 
two orders of magnitude, despite the challenging hardware 
setting and the reasonably high dimensionality of the data[^ 

Next we consider the one-shot search algorithm. As de- 
veloped in the theory section, we set rir (the number of rep- 
resentatives) and s (the number of points owned per repre- 
sentative) equal to one another. The parameter allows one 
to trade-off between the quality of the solution and time 
required; we scan over this parameter to show the trade- 
off. This algorithm is not guaranteed to return a nearest 
neighbor, so we must evaluate the quality of the solution. 
A standard error measure is the rank of the returned point: 
i.e., the number of database points closer to the query than 
the returned point [25]. A rank of denotes the exact NN, 
and rank of 1 denotes the second NN, and so on. 

*On the other hand, in limited experiments with (much 
cheaper) consumer-grade CPUs, we noticed that the run- 
times were not dramatically different. 

^There is one parameter in this algorithm; namely, the num- 
ber of representatives chosen. The retrieval times were not 
particularly sensitive to this choice; see Appendix [C] for a 
detailed examination of the parameter setting. 
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Figure 1: Results of the one-shot algorithm. This is a log-log plot of the speedup as a function of the error 
rate. The x-axis is logarithmic and runs from 10""' to 10"^ and signifies the average (over queries) rank of the 
returned result. For example, a rank of lO" indicates that on average the algorithm returns the 2nd NN. The 
j/-axis is also logarithmic and runs from lO" (no speedup) to 10* (lOOOOx speedup). 



Figure [T] shows the results. The speedup achieved in these 
experiments is quite significant; even with a rank around 
10~^ (very close to exact search), the worst speedup is an 
order of magnitude. In applications where a small amount 
of error is tolerable, the one-shot search algorithm can pro- 
vide a massive speedup, even better than our exact search 
algorithm in some cases. In many applications, e.g. in data 
mining, there is considerable uncertainty associated with the 
data, so a small amount of error in the NN retrieval is not 
important. 

7.3 GPU experiments 

GPUs have impressive brute force search performance [14| . 
However, the GPU architecture makes efficient data struc- 
ture design quite difficult. In particular, GPUs are vector- 
style processors with limited branching ability; hence con- 
ditional computation typically seriously under-utilizes these 
devices. 

We show that our RBC one-shot algorithm provides a sub- 
stantial speedup over the already-fast brute force search on a 
GPU. Table[2]shows the results. We show only the speedups, 
as the error rate is the same as that of the CPU experiments. 
The parameter was set to achieve an error rate of roughly 
10~^ (refer back to Figure [T]). Our method is clearly very 
effective in this setting; despite the challenging hardware de- 
sign, it provides a one-to-two order of magnitude speedup 
on all datasets. 

7.4 Cover Tree comparison 

Finally, we investigate the performance of the RBC on 
a quad-core desktop machine. We compare to the Cover 
Tree, which has state-of-the-art empirical performance on 
a sequential machine, and which was developed under the 
same notion of intrinsic dimensionality as the RBC. 

This is a challenging comparison for the RBC, as the sys- 



Data Speedup 

^io 38l 

Covertype 94.6 

Physics 19.0 

Robot 53.2 

Tinylm4 188.4 

Table 2: GPU results: speedup of the one-shot al- 
gorithm over brute force search (both on the GPU). 

tem does not require the extreme constraints under which 
the RBC was designed. In particular, this system can branch 
effectively and has a low core count. Hence such a system 
can handle an algorithm with a much more complex com- 
putational structure, like the Cover Tree search algorithm. 
Suprisingly, the performance of the RBC is at or near state- 
of-the-art even in this setting. 

The available implementation of the Cover Tree is single- 
core [2], and a naive parallelization would not significantly 
benefit it. In particular, one could split the database into 
p chunks (one for each core), run p Cover Tree searches in 
parallel, and then perform a p-way reduce. However, since 
the dependence of the Cover Tree on the number of data 
points is only O(logn), doing so would only decrease the 
runtime from 0{c^' logn) to, at best, 0{c^ log ^), which is a 
very minor improvement. Hence we run the Cover Tree only 
on one core, but allow the RBC to use the whole machine. 

Table [s] shows the results|^ The RBC is competitive on 
all of the datasets, and significantly outperforms the Cover 
Tree on the three largest datasets. Again, these results are 
surprising, as our exact search algorithm is much simpler 

®We were unable to get the Cover Tree software to run on 
the full Tinylm data sets, so we reduced the database size 
to IM. 



Data 


Cover Tree 


RBC 


Bio 


18.9 


6.4 


Covertype 


0.4 


1.1 


Physics 


1.9 


1.7 


Robot 


4.6 


5.1 


Tiny4 


0.5 


1.2 


TinyS 


14.6 


3.3 


Tinyie 


178.9 


25.1 


Tiny32 


387.0 


67.9 



Table 3: Comparison of the Cover Tree and the ex- 
act RBC algorithm on a quad-core desktop machine. 
Times shown are the total query time in seconds for 
10k queries. 

than the Cover Tree search algorithm, and since our methods 
have the additional (significant) constraint that they must 
work on highly parallel systems. 

We note that the RBC has a significantly lower theoret- 
ical dependence on the dimensionality than the Cover Tree 
{0{(?^^) vs 0{c?)). This is refiected in the experiments; the 
two datasets that the Cover Tree significantly outperforms 
the RBC on are very low-dimensional: the Tiny4 data set is 
four-dimensional, and the Covertype dataset has low intrin- 
sic dimensionality [2]. This reduced dependence on dimen- 
sionality appears to be another advantage of the RBC and 
would be interesting to explore in future work. 

8. CONCLUSION 

In this paper, we introduced techniques for metric simi- 
larity search on parallel systems. In particular, we demon- 
strated that the RBC search algorithms significantly reduce 
the work required for NN retrieval, while being structured 
in such a way that can be easily implemented on parallel 
systems. Our experiments show that these techniques are 
practical on a range of modern hardware. The theory behind 
the RBC shows that the data structure is broadly effective. 

Our code is available for download. These implementa- 
tions supply additional low-level details on implementing the 
RBC. Moreover, they are practical tools for many NN search 
problems. 

An interesting direction for future work is to explore the 
performance of the RBC in a distributed or multi-CPU en- 
vironment. The RBC data structure suggests a simple dis- 
tribution of the database according to the representatives 
that could be quite effective in such environments. There 
are many interesting details for study here, such as I/O 
and communication costs, and the connection to distributed 
programming paradigms. Furthermore, a distributed imple- 
mentation would be broadly useful in practice. 
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APPENDIX 

A. PROOF OF LEMMA 

We restate Lemma [l] and prove it. 

Lemma 2. Let R C X and assign each x £ X to its near- 
est r £ R. Let 7 — miurgi? r) (i.e., 7 is the distance 
to q's NN in R). Then, if some r* £ R owns the nearest 
neighbor to q in X, it must satisfy 

p{q,r*) < 37. 

Proof. Suppose that x is q's NN in X — i.e. p{q, x) < 
p{q, y) for all y G X — and that r* owns x (r* is x's NN 
among R). Furthermore, let r be q's NN in R. Since R C X 
and p{q, r) = 7, 7 gives an upper bound on the distance to 
q's NN; hence p{q, x) < 7. Using this bound along with the 
triangle inequality gives p{x,r) < 27, but since p{x,r*) < 
p{x, r), we have p{x, r*) < 27 as well. Applying the triangle 
inequality to the bounds on p{x,r*) and p{q,x) yields the 
lemma. □ 



Proof. Let a; G B{q,y). Thenp(a;,r) < p{x,q)+p{q,r) < 
27, proving the first inequality. For the second, let x £ 
B{r, 27) . Then p{x, q) < p{x, r) + p{r, q) < 27 4- 7 < 47. □ 

Now we restate and prove Theorem [2] 
Thereom 3. 5*6* the parameters 




Then the one-shot algorithm returns the correct NN with 
probability at least 1 — 5. 

Proof. If a query q lies within distance tpr/2 of a repre- 
sentative r, then its nearest neighbor Xq is guaranteed to be 
in L,.. This follows simply from the triangle inequality: 

p{r,Xq) < p{r,q) + p{q,Xg) 

< p{r, q) + p{q, r) (since Xq is q's NN) 

Thus the algorithm fails only if p{q, r) > V'r/2 for all r (and 
in particular that nearest one). We bound the probability 
of this occurring. 

Let 7 = p(q,r), where r is q's NN. Again, assume that 
7 > ■4)r/2. We have that Bir, ip,-) C B{r, 27), and B{r, 27) C 
B{q, 47) by Lemma [s] Hence we can apply the expansion 
condition to get that 

|B(r,VOI < |B(g, 47)1 <c2lB(g, 7)1- 

In particular, \B{q,'y)\ > l/c^|_B(r, i/)r)|; of course B{r,ipr) 
contains the points of Lr, which we chose to be s. Thus 
there are at least s/(? points closer to q than r. What is 
the probability that none of them were chosen as represen- 
tatives? 

Each point is chosen with probability n^/n, so the prob- 
ability that none are chosen is 



which follows from the inequality (1 — t/rY < e *. If we 
plug in Tir = s = ^Jr\(?n, we get a probability of failure 
bounded by 

The theorem follows by setting S = e~^ and rearranging. □ 

C. EXACT SEARCH EXPERIMENTS 

There is a single parameter to set for the exact search al- 
gorithm (the number of representatives). Here we report the 
results over a fairly wide range of parameters to support the 
experimental results. Note that the search time is relatively 
stable to this setting. See Figure |3] for the results. 



B. PROOF OF THE ONE-SHOT THEOREM 

We will use the following lemma, which is well-known. We 
include a proof for completeness. 

Lemma 3. Suppose that p{q,r) = y. Then 



B(g,7)cB(r,27)cB(g,47). 
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Figure 3: Results for exact search. The y-axis is a logarithmic scale that varies from 10° (no speedup) to 10^ 
(lOOOx speedup). The x-axis is the number of representatives chosen. 



